Skip to main content

5.4.1 Factor variables

Factor variables can be used to automate the recoding of multi-category variables so that they can be used in a regression expression. In practice, each category minus the reference category will be represented by separate dummy variables, where the estimates are interpreted relative to the reference category. The prefix i. is then used in front of the variable name in the regression expression, and the lowest value is automatically used as the reference value.

Factor variables can also be used to estimate the effects of combinations of values ​​for categorical variables (in addition to the effect each individual explanatory variable has separately). The rationale behind this is that certain properties have different effects on the dependent variable when looking at different groups. For example, the effect of education on future income may be systematically different for men versus women. In such cases, factor variables can be useful.

In regression expressions, factor variables and combinations of these are specified in the following way: The i. prefix is used to indicate that a variable is to be interpreted as categorical, while the # symbol is used to indicate that all categories except the reference groups are to be combined and estimated through the respective coefficient estimates. When using ##, each individual category will also be estimated separately and included in the regression analysis.

Example of linear regression analysis with income19 as the dependent variable. The independent variables are man, edulevel, and all subgroups of the two variables combined with each other except the reference group:

regress income19 i.man i.edulevel edulevel#man

Result:

This alternative expression will give the same result:

regress income19 edulevel##man

The c. prefix can be used to signal that a variable is to be regarded as a continuous variable (non-categorical). This may be relevant to use in cases where a variable can be interpreted as continuous, e.g. "level of education" or "age". The following expression runs a similar regression as above, but where education level is considered a continuous variable:

regress income19 i.man c.edulevel edulevel#man

Result: